Abstract

In order to optimize dataframe:

Downcasting numeric columns

df_num = df.select_dtypes(include=['int64','float64'])
converted_num = df_num.apply(pd.to_numeric,downcast='unsigned')

Converting string column to categorical type

When we convert columns to category, it's important to be aware of trade-off:

  • we can't perform numerical computation on category
  • we should use the category type primarily for object column where less than 50% of the values are unique
gl_obj = gl.select_dtypes(include=['object']).copy()

converted_obj = pd.DataFrame()

for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]

Converting date

df['date'] = pd.to_datetime(date,format='%Y%m%d')

Specify type when load data

dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes_col = dtypes.index
dtypes_type=[i.name for i in dtypes.values]
column_types = dict(zip(dtypes_col, dtypes_type))
read_and_optimized = pd.read_csv('..\data\game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)

Tip to Reduce memory usage


In [9]:
import os
import pandas as pd

In [10]:
# Load Data
gl = pd.read_csv('..\data\game_logs.csv')
# Available also at https://data.world/dataquest/mlb-game-logs


C:\anaconda\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [11]:
# Data Preview
gl.head()


Out[11]:
date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number v_score ... h_player_7_name h_player_7_def_pos h_player_8_id h_player_8_name h_player_8_def_pos h_player_9_id h_player_9_name h_player_9_def_pos additional_info acquisition_info
0 18710504 0 Thu CL1 na 1 FW1 na 1 0 ... Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 18710505 0 Fri BS1 na 1 WS3 na 1 20 ... Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 18710506 0 Sat CL1 na 2 RC1 na 1 12 ... Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 18710508 0 Mon CL1 na 3 CH1 na 1 12 ... Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 18710509 0 Tue BS1 na 2 TRO na 1 9 ... Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

5 rows × 161 columns


In [14]:
gl.dtypes.head()


Out[14]:
date               int64
number_of_game     int64
day_of_week       object
v_name            object
v_league          object
dtype: object

In [37]:
# Select only the column with same type
gl.select_dtypes(include=['object']).head()


Out[37]:
day_of_week v_name v_league h_name h_league day_night completion forefeit protest park_id ... h_player_6_id h_player_6_name h_player_7_id h_player_7_name h_player_8_id h_player_8_name h_player_9_id h_player_9_name additional_info acquisition_info
0 Thu CL1 na FW1 na D NaN NaN NaN FOR01 ... caret101 Tom Carey mince101 Ed Mincher mcdej101 James McDermott kellb105 Bill Kelly NaN Y
1 Fri BS1 na WS3 na D NaN NaN NaN WAS01 ... leona101 Andy Leonard braia102 Asa Brainard burrh101 Henry Burroughs berth101 Henry Berthrong HTBF Y
2 Sat CL1 na RC1 na D NaN NaN NaN RCK01 ... ansoc101 Cap Anson sagep101 Pony Sager birdg101 George Bird stirg101 Gat Stires NaN Y
3 Mon CL1 na CH1 na D NaN NaN NaN CHI01 ... folet101 Tom Foley duffe101 Ed Duffy pinke101 Ed Pinkham zettg101 George Zettlein NaN Y
4 Tue BS1 na TRO na D NaN NaN NaN TRO01 ... beave101 Edward Beavens bells101 Steve Bellan pikel101 Lip Pike cravb101 Bill Craver HTBF Y

5 rows × 78 columns


In [16]:
#Exact amount of memory usage of df
gl.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB

First Look


In [18]:
gl.describe()


Out[18]:
date number_of_game v_game_number h_game_number v_score h_score length_outs attendance length_minutes v_at_bats ... v_player_9_def_pos h_player_1_def_pos h_player_2_def_pos h_player_3_def_pos h_player_4_def_pos h_player_5_def_pos h_player_6_def_pos h_player_7_def_pos h_player_8_def_pos h_player_9_def_pos
count 1.719070e+05 171907.000000 171907.000000 171907.000000 171907.000000 171907.000000 140841.000000 118877.000000 136701.000000 140838.000000 ... 140835.000000 140838.000000 140838.000000 140838.000000 140838.000000 140838.000000 140838.000000 140838.000000 140838.000000 140838.000000
mean 1.953462e+07 0.260897 76.929887 76.953806 4.420582 4.701461 53.619976 20184.247188 150.903329 34.914398 ... 1.894778 6.462560 5.914767 6.435770 5.922982 5.786144 5.497628 4.956184 4.080078 1.895873
std 4.149326e+05 0.605667 45.178029 45.162564 3.278489 3.355605 5.571512 14257.381902 34.748160 4.633761 ... 1.939072 1.813946 1.946100 2.306364 2.529624 2.521525 2.446669 2.264262 2.167098 1.936534
min 1.871050e+07 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 24.000000 12.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 1.918052e+07 0.000000 38.000000 38.000000 2.000000 2.000000 51.000000 7962.000000 125.000000 32.000000 ... 1.000000 5.000000 4.000000 4.000000 3.000000 3.000000 3.000000 3.000000 2.000000 1.000000
50% 1.953053e+07 0.000000 76.000000 76.000000 4.000000 4.000000 54.000000 18639.000000 150.000000 34.000000 ... 1.000000 7.000000 6.000000 7.000000 7.000000 6.000000 5.000000 5.000000 4.000000 1.000000
75% 1.989051e+07 0.000000 115.000000 115.000000 6.000000 6.000000 54.000000 31242.000000 173.000000 37.000000 ... 1.000000 8.000000 8.000000 8.000000 8.000000 8.000000 8.000000 6.000000 6.000000 1.000000
max 2.016100e+07 3.000000 165.000000 165.000000 49.000000 38.000000 156.000000 99027.000000 1245.000000 85.000000 ... 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000

8 rows × 83 columns


In [24]:
# Reference http://www.markhneedham.com/blog/2017/07/05/pandas-find-rows-where-columnfield-is-null/
# Columns with null values

null_columns=gl.columns[gl.isnull().any()]
gl[null_columns].isnull().sum()


Out[24]:
length_outs                      31066
day_night                        31757
completion                      171791
forefeit                        171762
protest                         171727
attendance                       53030
length_minutes                   35206
v_line_score                     24636
h_line_score                     24636
v_at_bats                        31069
v_hits                           31069
v_doubles                        31069
v_triples                        31069
v_homeruns                       31069
v_rbi                            31069
v_sacrifice_hits                 31069
v_sacrifice_flies                31069
v_hit_by_pitch                   31069
v_walks                          31069
v_intentional walks              31069
v_strikeouts                     31069
v_stolen_bases                   31069
v_caught_stealing                31069
v_grounded_into_double           31069
v_first_catcher_interference     31069
v_left_on_base                   31069
v_pitchers_used                  31069
v_individual_earned_runs         31069
v_team_earned_runs               31069
v_wild_pitches                   31069
                                 ...  
v_player_9_def_pos               31072
h_player_1_id                    31069
h_player_1_name                  31069
h_player_1_def_pos               31069
h_player_2_id                    31069
h_player_2_name                  31069
h_player_2_def_pos               31069
h_player_3_id                    31069
h_player_3_name                  31069
h_player_3_def_pos               31069
h_player_4_id                    31069
h_player_4_name                  31069
h_player_4_def_pos               31069
h_player_5_id                    31069
h_player_5_name                  31069
h_player_5_def_pos               31069
h_player_6_id                    31069
h_player_6_name                  31069
h_player_6_def_pos               31069
h_player_7_id                    31069
h_player_7_name                  31069
h_player_7_def_pos               31069
h_player_8_id                    31069
h_player_8_name                  31069
h_player_8_def_pos               31069
h_player_9_id                    31069
h_player_9_name                  31069
h_player_9_def_pos               31069
additional_info                 170451
acquisition_info                 31066
dtype: int64

In [26]:
# Every row that contains at least one null value
print(gl[gl.isnull().any(axis=1)][null_columns].head())


   length_outs day_night completion forefeit protest  attendance  \
0         54.0         D        NaN      NaN     NaN       200.0   
1         54.0         D        NaN      NaN     NaN      5000.0   
2         54.0         D        NaN      NaN     NaN      1000.0   
3         54.0         D        NaN      NaN     NaN      5000.0   
4         54.0         D        NaN      NaN     NaN      3250.0   

   length_minutes v_line_score h_line_score  v_at_bats        ...         \
0           120.0    000000000    010010000       30.0        ...          
1           145.0    107000435    640113030       41.0        ...          
2           140.0    610020003    010020100       49.0        ...          
3           150.0    101403111    077000000       46.0        ...          
4           145.0    000002232    101003000       46.0        ...          

   h_player_7_name  h_player_7_def_pos  h_player_8_id  h_player_8_name  \
0       Ed Mincher                 7.0       mcdej101  James McDermott   
1     Asa Brainard                 1.0       burrh101  Henry Burroughs   
2       Pony Sager                 6.0       birdg101      George Bird   
3         Ed Duffy                 6.0       pinke101       Ed Pinkham   
4     Steve Bellan                 5.0       pikel101         Lip Pike   

   h_player_8_def_pos  h_player_9_id  h_player_9_name  h_player_9_def_pos  \
0                 8.0       kellb105       Bill Kelly                 9.0   
1                 9.0       berth101  Henry Berthrong                 8.0   
2                 7.0       stirg101       Gat Stires                 9.0   
3                 5.0       zettg101  George Zettlein                 1.0   
4                 3.0       cravb101      Bill Craver                 6.0   

   additional_info  acquisition_info  
0              NaN                 Y  
1             HTBF                 Y  
2              NaN                 Y  
3              NaN                 Y  
4             HTBF                 Y  

[5 rows x 145 columns]

Under the hood, pandas groups the columns into block of values of the same type:

  • ObjectBlock, contains string
  • FloatBlock (ndarray)
  • IntBlock (ndarray)

Because each data type is stored separately, we examine the memory usage by data type.

Average memory usage for data type


In [62]:
gl.dtypes.value_counts()


Out[62]:
object     78
float64    77
int64       6
dtype: int64

In [60]:
for dtype in gl.dtypes.unique(): #['float','int64','object']:
    selected_dtype = gl.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))


Average memory usage for int64 columns: 1.12 MB
Average memory usage for object columns: 9.51 MB
Average memory usage for float64 columns: 1.29 MB

Subtype

Under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. This approach:

  • consumes less space
  • allows us to access quickly

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes.

To discovare the range of values of given dtype we can use https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.iinfo.html


In [63]:
import numpy as np
int_types = ["uint8", "int8", "int16"]
for it in int_types:
    print(np.iinfo(it))


Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Optimize Numeric Columns with subtypes

We can use the function pd.to_numeric() to downcast our numeric types


In [64]:
# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!

def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

In [65]:
mem_usage(gl)


Out[65]:
'860.50 MB'

In [68]:
gl_int = gl.select_dtypes(include=['int64'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')

In [69]:
print(mem_usage(gl_int))
print(mem_usage(converted_int))


7.87 MB
1.48 MB

In [70]:
compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)


Out[70]:
before after
uint8 NaN 5.0
uint32 NaN 1.0
int64 6.0 NaN

We can see a drop from 7.9 to 1.5 MB. Now we have 5 uint8 and 1 unit32 instead of 6 int64.

Lets do the same thing with float columns


In [72]:
gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')

print(mem_usage(gl_float))
print(mem_usage(converted_float))

compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)


100.99 MB
50.49 MB
Out[72]:
before after
float32 NaN 77.0
float64 77.0 NaN

All our float columns were converted from float64 to float32, give us a 50 reduction in memory usage.

We apply this optimizations on the entire df


In [73]:
optimized_gl = gl.copy()

optimized_gl[converted_int.columns] = converted_int
optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))
print(mem_usage(optimized_gl))


860.50 MB
803.61 MB

In order to have the best benefit, we have to optimize the object types.

Comparing Numeric and String Storage

The object type represents values using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn’t have fine grained-control over how values in memory are stored.

This limitation causes strings to be stored in a fragmented way that consumes more memory and is slower to access. Each element in an object column is really a pointer that contains the “address” for the actual value’s location in memory.

Object types as using a variable amount of memory. While each pointer takes up 1 byte of memory, each actual string value uses the same amount of memory that string would use if stored individually in Python. Let’s use sys.getsizeof() to prove that out, first by looking at individual strings, and then items in a pandas series.


In [74]:
from sys import getsizeof

s1 = 'working out'
s2 = 'memory usage for'
s3 = 'strings in python is fun!'
s4 = 'strings in python is fun!'
          
for s in [s1, s2, s3, s4]:
    print(getsizeof(s))


60
65
74
74

In [76]:
obj_series = pd.Series(['working out',
                          'memory usage for',
                          'strings in python is fun!',
                          'strings in python is fun!'])
obj_series.apply(getsizeof)


Out[76]:
0    60
1    65
2    74
3    74
dtype: int64

You can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python.

Optimizing object types using categoricals

The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.


In [78]:
#Where we migh be able to reduce memory?
gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()


Out[78]:
day_of_week v_name v_league h_name h_league day_night completion forefeit protest park_id ... h_player_6_id h_player_6_name h_player_7_id h_player_7_name h_player_8_id h_player_8_name h_player_9_id h_player_9_name additional_info acquisition_info
count 171907 171907 171907 171907 171907 140150 116 145 180 171907 ... 140838 140838 140838 140838 140838 140838 140838 140838 1456 140841
unique 7 148 7 148 7 2 116 3 5 245 ... 4774 4720 5253 5197 4760 4710 5193 5142 332 1
top Sat CHN NL CHN NL D 20010616,,0,0,6 H V STL07 ... grimc101 Charlie Grimm grimc101 Charlie Grimm lopea102 Al Lopez spahw101 Warren Spahn HTBF Y
freq 28891 8870 88866 9024 88867 82724 1 69 90 7022 ... 427 427 491 491 676 676 339 339 1112 140841

4 rows × 78 columns

A quick glance reveals many columns where there are few unique values relative to the overall ~172,000 games in our data set. We start to convert day_of_week to category using the .astype(method).


In [79]:
dow = gl_obj.day_of_week
print(dow.head())

dow_cat = dow.astype('category')
print(dow_cat.head())


0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: object
0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]

In [80]:
# We can see the integer values associated to column
dow_cat.head().cat.codes


Out[80]:
0    4
1    0
2    2
3    1
4    5
dtype: int8

In [81]:
# We compare the memory usage  
print(mem_usage(dow))
print(mem_usage(dow_cat))


9.84 MB
0.16 MB

When we convert columns to category, it's important to be aware of trade-off:

  • we can't perform numerical computation on category
  • we should use the category type primarily for object column where less than 50% of the values are unique

In [82]:
converted_obj = pd.DataFrame()

for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]

In [83]:
print(mem_usage(gl_obj))
print(mem_usage(converted_obj))

compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before','after']
compare_obj.apply(pd.Series.value_counts)


751.64 MB
42.49 MB
Exception ignored in: 'pandas.lib.array_equivalent_object'
TypeError: data type not understood
Out[83]:
before after
object 78.0 NaN
category NaN 78.0

In [84]:
# Now we combine with the rest of our dataframe (numeric columns)
optimized_gl[converted_obj.columns] = converted_obj

mem_usage(optimized_gl)


Out[84]:
'94.46 MB'

Convert date


In [85]:
date = optimized_gl.date
print(mem_usage(date))
date.head()


0.66 MB
Out[85]:
0    18710504
1    18710505
2    18710506
3    18710508
4    18710509
Name: date, dtype: uint32

We’ll convert using pandas.to_datetime() function, using the format parameter to tell it that our date data is stored YYYY-MM-DD.


In [86]:
optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d')

print(mem_usage(optimized_gl))
optimized_gl.date.head()


95.11 MB
Out[86]:
0   1871-05-04
1   1871-05-05
2   1871-05-06
3   1871-05-08
4   1871-05-09
Name: date, dtype: datetime64[ns]

Selecting Types While Reading the Data In

we can specify the optimal column types when we read the data set in. The pandas.read_csv() function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values.


In [93]:
dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes.head()


Out[93]:
number_of_game       uint8
day_of_week       category
v_name            category
v_league          category
v_game_number        uint8
dtype: object

In [94]:
dtypes_col = dtypes.index
dtypes_col


Out[94]:
Index(['number_of_game', 'day_of_week', 'v_name', 'v_league', 'v_game_number',
       'h_name', 'h_league', 'h_game_number', 'v_score', 'h_score',
       ...
       'h_player_7_name', 'h_player_7_def_pos', 'h_player_8_id',
       'h_player_8_name', 'h_player_8_def_pos', 'h_player_9_id',
       'h_player_9_name', 'h_player_9_def_pos', 'additional_info',
       'acquisition_info'],
      dtype='object', length=160)

In [103]:
dtypes_type=[i.name for i in dtypes.values]

In [104]:
column_types = dict(zip(dtypes_col, dtypes_type))

In [117]:
#Preview of first 10
{k:v for k,v in list(column_types.items())[:10]}


Out[117]:
{'day_of_week': 'category',
 'h_game_number': 'uint8',
 'h_league': 'category',
 'h_name': 'category',
 'h_score': 'uint8',
 'number_of_game': 'uint8',
 'v_game_number': 'uint8',
 'v_league': 'category',
 'v_name': 'category',
 'v_score': 'uint8'}

Now we can use the dictionary, along with a few parameters for the date to read in the data with the correct types in a few lines:


In [119]:
read_and_optimized = pd.read_csv('..\data\game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)

print(mem_usage(read_and_optimized))
read_and_optimized.head()


95.10 MB
Out[119]:
date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number v_score ... h_player_7_name h_player_7_def_pos h_player_8_id h_player_8_name h_player_8_def_pos h_player_9_id h_player_9_name h_player_9_def_pos additional_info acquisition_info
0 1871-05-04 0 Thu CL1 na 1 FW1 na 1 0 ... Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 1871-05-05 0 Fri BS1 na 1 WS3 na 1 20 ... Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 1871-05-06 0 Sat CL1 na 2 RC1 na 1 12 ... Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 1871-05-08 0 Mon CL1 na 3 CH1 na 1 12 ... Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 1871-05-09 0 Tue BS1 na 2 TRO na 1 9 ... Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

5 rows × 161 columns

Analyzing baseball games

how game length has varied over the years


In [136]:
import matplotlib.pyplot as plt
optimized_gl['year'] = optimized_gl.date.dt.year
game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes')
game_lengths.reset_index().plot.scatter('year','length_minutes')
plt.show()